Model Selection

Zero-shot Inference

# Zero-shot Inference

Devstral Small Vision 2505 GGUF

Vision encoder based on Mistral Small model, supports image-text generation tasks, compatible with llama.cpp framework

Google.medgemma 4b It GGUF

MedGemma-4B-IT is a medical-focused image-to-text generation model developed by Google.

VL Rethinker 7B 8bit

VL-Rethinker-7B-8bit is a multimodal model based on Qwen2.5-VL-7B-Instruct, supporting visual question answering tasks.

Transformers English

VL Rethinker 7B Fp16

This model is a multimodal vision-language model converted from Qwen2.5-VL-7B-Instruct, supporting visual question answering tasks.

Transformers English

Qwen2.5 VL 32B Instruct GGUF

Qwen2.5-VL-32B-Instruct is a multimodal vision-language model that supports joint understanding and generation tasks for both images and text.

Image-to-Text English

Qwen2.5 VL 72B Instruct GGUF

Qwen2.5-VL-72B-Instruct is a multimodal vision-language model that supports interactive generation tasks involving images and text.

Image-to-Text English

ARPG is an innovative autoregressive image generation framework capable of achieving BERT-style masked modeling through a GPT-like causal architecture.

Image Generation

Qwen2.5 14B CIC ACLARC

A citation intent classification model fine-tuned based on Qwen 2.5 14B Instruct, specifically designed for citation intent classification in scientific publications.

Text Classification

Transformers English

Eagle 2 is a high-performance vision-language model family that focuses on transparency in data strategies and training schemes, aiming to drive the open-source community in developing competitive vision-language models.

Transformers Other

AiM is an unconditional image generation model based on PyTorch, integrated and pushed to Hugging Face Hub via PytorchModelHubMixin.

Image Generation

Minicpm Llama3 V 2 5 GGUF

MiniCPM-Llama3-V-2_5 is a multimodal visual question answering model based on the Llama3 architecture, supporting both Chinese and English interactions.

Text-to-Image Supports Multiple Languages

Depth Anything V2 Metric Indoor Large Hf

A fine-tuned version of Depth Anything V2 for indoor metric depth estimation using the synthetic Hypersim dataset, compatible with the transformers library.

Depth Anything V2 Metric Indoor Base Hf

A version fine-tuned for indoor metric depth estimation tasks using the Hypersim synthetic dataset, based on the Depth Anything V2 model

Depth Anything V2 Metric Indoor Small Hf

A model fine-tuned from Depth Anything V2 for indoor metric depth estimation tasks, trained on the synthetic dataset Hypersim, compatible with the transformers library.

Depth Anything V2 Metric Outdoor Small Hf

A fine-tuned version of Depth Anything V2, specifically designed for metric depth estimation in outdoor scenes, trained on the synthetic dataset Virtual KITTI.

Chronos T5 Base

Chronos is a family of pre-trained time series forecasting models based on language model architectures, which transform time series into token sequences for training through quantization and scaling.

BLIP-2 is a vision-language model based on OPT-2.7b, which achieves image-to-text generation by freezing the image encoder and large language model while training a query transformer.

Transformers English

Blip2 Flan T5 Xxl

BLIP-2 is a vision-language model that combines an image encoder with the large language model Flan T5-xxl for image-to-text tasks.

Transformers English

Blip2 Flan T5 Xl

BLIP-2 is a vision-language model based on Flan T5-xl, pre-trained by freezing the image encoder and large language model, supporting tasks such as image captioning and visual question answering.

Transformers English

FLAN-T5 XL is an instruction-finetuned language model based on the T5 architecture, with significantly improved multilingual and few-shot performance after fine-tuning on 1000+ tasks.

Large Language Model Supports Multiple Languages

Gpt2 Question Answering Squad2

A question-answering model based on the GPT-2 architecture, fine-tuned specifically for the SQuAD2 dataset, capable of answering questions based on given text.

Question Answering System

Monot5 Base Msmarco

A re-ranking model based on the T5-base architecture, fine-tuned for 100,000 steps on the MS MARCO passage dataset, suitable for document re-ranking tasks in information retrieval.

Large Language Model

Deberta V3 Base Mnli

DeBERTa-v3 model trained on the MultiNLI dataset for natural language inference tasks, excelling in zero-shot classification scenarios.

Text Classification

Transformers English

A zero-shot classification model for Arabic, supporting natural language inference tasks

Text Classification

Transformers Supports Multiple Languages

KheireddineDaouadi

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase